The MULTEXT-East corpus
نویسندگان
چکیده
The EU MULTEXT-East project has produced harmonised language resources for Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In this paper we introduce the MULTEXT-East multilingual corpus, which comprises marked-up texts in the six languages totaling approximately 2 million words and a small speech corpus. The corpus is encoded in SGML, in the TEI-like Corpus Encoding Specification and is divided into a parallel and a comparable (fiction, news) part. The parallel corpus consists of the novel "1984" by George Orwell in the English original and translations. The translations are sentence aligned with the original and tagged for word-level linguistic information, i.e. for morphosyntactic descriptions and lemmas. Detailed information on the corpus is available on the WWW and the corpus itself has been released for research purposes on a CD-ROM in the scope of the TELRI concerted action.
منابع مشابه
Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages
The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwell’s Nineteen Eighty-Four, with versions in all six languages...
متن کاملMULTEXT-East Resources for Serbian
The paper presents the MULTEXT-East language resources for the Serbian language. MULTEXT-East is a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe wordlevel s...
متن کاملPersian in MULTEXT-East Framework
Farsi, also known as Persian, is the official language of Iran, Tajikistan and one of the two main languages spoken in Afghanistan. It is an Indo-European agglutinating language, written in Arabic script. This paper presents the first step in creating Farsi basic language resources kit. This Step comprises the specifications for morphosyntactic encoding, which is based on the EAGLES/MULTEXT mod...
متن کاملOWL/DL formalization of the MULTEXT-East morphosyntactic specifications
This paper describes the modeling of the morphosyntactic annotations of the MULTEXT-East corpora and lexicons as an OWL/DL ontology. Formalizing annotation schemes in OWL/DL has the advantages of enabling formally specifying interrelationships between the various features and making logical inferences based on the relationships between them. We show that this approach provides us with a top-dow...
متن کاملMULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe word-level syntactic annotation...
متن کامل